Optimal alphabet for single text compression
نویسندگان
چکیده
A text written using symbols from a given alphabet can be compressed the Huffman code, which minimizes length of encoded text. It is necessary, however, to employ text-specific codebook, i.e. symbol-codeword dictionary, decode original Thus, compression performance should evaluated by full code length, plus codebook. We studied several alphabets for compressing texts – letters, n-grams syllables, words, and phrases. If only sufficiently short are retained, an letters or two-grams optimal. For majority Project Gutenberg texts, best (the one that length) syllables depending on representation Letter 3 4-grams, having average comparable syllables/words, perform noticeably worse than words. Word 2-grams also never alphabet, account very large show codebook important switching naive compact significantly improves matters with number symbols, most notably meaning-expressing elements language (syllables words) provide alphabet.
منابع مشابه
Text Compression Via Alphabet Re-Representation
This article introduces the concept of alphabet re-representation in the context of text compression. We consider re-representing the alphabet so that a representation of a character reflects its properties as a predictor of future text. This enables us to use an estimator from a restricted class to map contexts to predictions of upcoming characters. We describe an algorithm that uses this idea...
متن کاملOn Large Alphabet Compression
In this report, we present results in Large Alphabet Compression. We first show that the min-max redundancy of standard compression tends towards infinity for sufficiently large alphabets. With this, we motivate two other approaches that are employed in compressing large alphabets, namely pattern and shape compression. We then present upper and lower bounds on the min-max redundancy of the same.
متن کاملA large-alphabet-oriented scheme for Chinese and English text compression
In this paper, a large alphabet oriented scheme is proposed for both Chinese and English text compression. Our scheme parses Chinese text with the alphabet defined by Big-5 code, and parses English text with some rules designed here. Thus, the alphabet used for English is not a word alphabet. After parsed out into tokens, zero, first, and second order Markov models are used to estimate the occu...
متن کاملAlphabet Permutation for Differentially Encoding Text
One degree of freedom which is usually not exploited in developing high-performance textprocessing algorithms is the encoding of the underlying atomic character set. Typically, standard character encodings such as ASCII or Unicode are assumed to be a fixed fact of nature, and indeed for most classical string algorithms the assignment of exactly which symbol maps to which k-length bit pattern ap...
متن کاملOptimal Alphabet Partitioning for Semi-Adaptive Coding
Practical applications that employ entropy coding for large alphabets often partition the alphabet set into two or more layers and encode each symbol by using some suitable prefix coding for each layer. In this paper we formulate the problem of optimal alphabet partitioning for the design of a two layer semi-adaptive code and give a solution based on dynamic programming. However, the complexity...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Information Sciences
سال: 2023
ISSN: ['0020-0255', '1872-6291']
DOI: https://doi.org/10.1016/j.ins.2022.10.104